Economical Inversion of Large Text Files
نویسنده
چکیده
To provide keyword-based access to a large text file it is usually necessary to invert the file and create an inverted index that storeso for each word in the file, the paragraph or sentence numbers in which that word occurs. Inverting alarge file using traditional techniques may take as much temporary disk space as is occupied by the file itself, and consume a great deal of cpu time. Here we describe an alternative technique for inverting large text files that requires only a nominal amount of temporary disk storage, instead building the inverted index in compressed form in main memory. A program implementing this approach has created a paragraph level index of a I32 Mbyte collection of legal documents using 13 Mbyte of main memory; 500 Kbyte of temporary disk storage; and approximately 45 cpu-minutes on a Sun SPARCstation 2. @ Computing Systems, Vol. 5 . No. 2 ' Spring 1992 125
منابع مشابه
Efficient single-pass index construction for text databases
Efficient construction of inverted indexes is essential to provision of search over large collections of text data. In this article, we review the principal approaches to inversion, analyze their theoretical cost, and present experimental results. We identify the drawbacks of existing inversion approaches and propose a single-pass inversion method that, in contrast to previous approaches, does ...
متن کاملCompression and Fast Indexing for Multi-Gigabyte Text Databases
In the last two years we have developed improved techniques for indexing and retrieval of text data, including algorithms for inversion, for compression of the data and index, and for economical ranking. These techniques were, however, tested on relatively small databases. In this paper we describe our experiences in scaling these techniques up to a large (2 Gb) heterogeneous text database. Our...
متن کاملPhase Inversion in a Batch Liquid – Liquid Stirred System
"> Phase inversion phenomenon occurs in many industrial processes including liquidliquid dispersions. Some parameters such as energy input or the presence of mineral compounds in the system affect this phen...
متن کاملLarge-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation
In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...
متن کاملA Superimposed Coding Scheme Based on Multiple Block Descriptor Files for Indexing Very Large Data Bases
A new signature file method for accessing information from large data files containing both formatted and free text data is presented. The new method, called the multiorganizational scheme is proposed for indexing very large data files containing hundreds of thousands or possibly millions of records.
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Computing Systems
دوره 5 شماره
صفحات -
تاریخ انتشار 1992